{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 贝叶斯分类实现文本分类\n",
    "\n",
    "贝叶斯分类器的主要思想：\n",
    "\n",
    "        已知一个含有标记的数据集，此时来了一个测试样本，我们知道测试样本的特征，需要预测标记，\n",
    "        若我们能够求出这个样本属于各个类别的概率，那么从中选择概率最大的就可以了，那么就是求P(c|x)，\n",
    "        先用全概率公式P(c|x)=P(x,c)/P(x)，再用条件概率公式P(c|x)=P(x,c)/P(x)=P(x|c)*P(c)/P(x)，\n",
    "        对于同一个测试样本P(x)都是相同的，因此分母不是我们需要关心的。P(c)很好求，就是在数据集当中某个类别出现的概率\n",
    "        最难求的就是P(x|c)，朴素贝叶斯的思想就是假设各个特征之间相互独立，那么P(x|c)就等于P(x1|c)*P(x2|c)...，这样就可以求解了\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.datasets import fetch_20newsgroups\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.decomposition import PCA\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.naive_bayes import MultinomialNB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def dataLoad(dataset):\n",
    "    # 加载文本数据集\n",
    "    feature=dataset.data\n",
    "    target=dataset.target\n",
    "    tf = TfidfVectorizer()\n",
    "    feature = tf.fit_transform(feature)\n",
    "    feature = feature.toarray()\n",
    "    feature = feature.astype(np.uint8)\n",
    "    # 特征降维\n",
    "    # pca\n",
    "    pca = PCA(n_components=100)\n",
    "    feature = pca.fit_transform(feature)\n",
    "    print(feature.shape)\n",
    "    # 划分数据集\n",
    "    x_train, x_test, y_train, y_test = train_test_split(feature, target, test_size=0.25)\n",
    "    x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.25)\n",
    "    return x_train, x_val,x_test, y_train, y_test,y_val "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bayes():\n",
    "    \n",
    "    x_train, x_val,x_test, y_train, y_test,y_val = dataLoad(fetch_20newsgroups())\n",
    "    \n",
    "   \n",
    "    mlt = MultinomialNB(alpha=1.0)           # 建立贝叶斯模型,alapha是拉普拉斯平滑系数，防止计算的概率是0\n",
    "    # 训练\n",
    "    mlt.fit(x_train, y_train)\n",
    "    # 验证\n",
    "    score_val = mlt.score(x_val, y_val)\n",
    "    print(\"在验证集上的得分：\", score_val)\n",
    "    # 预测\n",
    "    score_test = mlt.score(x_test, y_test)\n",
    "    print(\"在测试集上的得分：\", score_test)\n",
    "    predict = mlt.predict(x_test)\n",
    "    print(\"测试结果：\", predict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Downloading 20news dataset. This may take a few minutes.\n",
      "Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)\n"
     ]
    }
   ],
   "source": [
    "if __name__ == \"__main__\":\n",
    "    x_train, x_val,x_test, y_train, y_test,y_val = dataLoad(fetch_20newsgroups())\n",
    "    #bayes()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
